11  Bioinformatics

Learning Objectives

After completing this lab you should be able to

Download the directory for this project here, make sure the directory is unzipped and move it to your bi328 directory. You can open the Rproj for this module either by double clicking on it which will launch Rstudio or by opening Rstudio and then using File > Open Project or by clicking on the Rproject icon in the top right of your program window and selecting Open Project.

Once you have opened a project you should see the project name in the top right corner1.

  • 1 Pro tip: If you run into issues where a quarto document won’t render or file paths aren’t working (especially if things were working previously) one of your first steps should be to double check that the correct Rproj is loaded.

  • There should be a file named 11_bioinformatics.qmd in that project directory. Use that file to work through this tutorial - you will hand in your rendered (“knitted”) quarto file as your homework assignment. So, first thing in the YAML header, change the author to your name. You will use this quarto document to record your answers. Remember to use comments to annotate your code; at minimum you should have one comment per code set2 you may of course add as many comments as you need to be able to recall what you did. Similarly, take notes in the document as we discuss discussion/reflection questions but make sure that you go back and clean them up for “public consumption”.

  • 2 You should do this whether you are adding code yourself or using code from our manual, even if it isn’t commented in the manual… especially when the code is already included for you, add comments to describe how the function works/what it does as we introduce it during the participatory coding session so you can refer back to it.

  • CRAN (Comprehensive R Archive) is the central archive for R packages. However, it is not uncommon for some packages there are specialized archives that are field specific. For example, Bioconductor hosts many packages used for bioinformatics. After installing Bioconducter you can then install packages hosted there using BiocManager::install().

    Previously we have always called functions just using the function name. Given that you have loaded your packages, R automatically “knows” which package to call a function from. However, sometimes it might be helpful to specify which package a function is common from, either for documenting purposes (so that somebody else would know what library that function is from) or because sometimes there might be a function with the same name in more than one library and this way you can tell R explicitly which function you are calling.

    We will need to install some new packages before we can get rolling. Execute the following commands to install them - note that the code chunk has the code chunk option eval set as false which means that when you render your quarto document it won’t execute this code chunk and try to re-install these packages.

    # install Bioconductor
    install.packages("BiocManager")
    
    # install dependency for data
    BiocManager::install("ShortRead") 
    
    # install dada2
    BiocManager::install("dada2")
    
    # install Biostrings
    BiocManager::install("Biostrings")
    
    # if these packages are not yet installed install patchwork and plotly 
    install.packages("plotly", "patchwork")

    Now let’s load our packages so we get started.

    # load libraries ----
    
    # documentation
    library(knitr)
    
    # eDNA analysis
    library(dada2)
    
    
    # wrangling
    library(tidyr)
    library(dplyr)
    library(tibble)
    library(readr)
    library(glue)
    
    # visualization
    library(ggplot2)
    library(ggthemes)
    library(plotly)
    library(patchwork)

    11.1 Next-generation sequence data enable metabarcoding to describe community composition

    High-throughput sequencing data, often referred to as next-generation sequencing (NGS) data, is a type of biological data generated through advanced sequencing technologies that allow for the rapid and simultaneous sequencing of a large number of DNA or RNA molecules. These technologies have revolutionized genomics, transcriptomics, and other related fields by enabling the efficient and cost-effective analysis of genetic material on a massive scale. There are various high-throughput sequencing platforms available, including Illumina, Ion Torrent, Pacific Biosciences (PacBio), Oxford Nanopore Technologies (ONT), and others. High-throughput sequencing has become more cost-effective over time, making it accessible to a broader range of researchers and enabling large-scale genomics projects.

    The key difference between high-throughput sequencing platforms compared to traditional Sanger sequencing3 is that they are capable of simultaneously sequencing thousands to millions of DNA fragments or RNA molecules in a single sequencing run. This parallel processing greatly increases the speed and efficiency of sequencing compared to traditional Sanger sequencing.

  • 3 anger sequencing, also known as dideoxy sequencing or chain-termination sequencing, is a traditional DNA sequencing method that was developed by Frederick Sanger and his colleagues in the late 1970s. It was the primary method for DNA sequencing for several decades before the advent of high-throughput next-generation sequencing (NGS) technologies. Sanger sequencing is still used today for specific applications, such as sequencing individual genes or validating NGS results.

  • NGS platforms produce a vast amount of data. A single sequencing run can generate gigabytes to terabytes of raw sequencing data, depending on the instrument’s capacity and the type of experiment being conducted4. Typically, high-throughput sequencing generates short sequences, referred to as “reads.” The length of these reads can vary depending on the sequencing platform but is typically in the range of 50 to 300 base pairs. Usually you get one forward and one reverse read per sequenced template DNA.

  • 4  Don’t worry, we are going to use a data set that consists of samples have been modified to reduce size but still preserve realistic sequence variation and errors to create small example data set that your laptops can handle.

  • In the context of metabarcoding of environmental DNA this means that we can use universal primers to amplify template DNA present in the environmental sample that potentially are from different organisms5 because we can sequence all the amplified template DNA strands at once.

  • 5 If you used Sanger sequencing you would just get a bunch of noise, because the sequencer would not be able to make a distinct base call for any position

  • 11.2 Bioinformatics pipelines are important tools to processing NGS data

    Analyzing high-throughput sequencing data requires sophisticated bioinformatics tools and pipelines. Bioinformatics pipelines are widely used in genomics and related fields to streamline and standardize data analysis workflows. They comprise a series of structured and automated series of computational and data analysis steps designed to process and analyze biological data. Frequently, data takes on of several paths through the pipeline where the output from one step becomes the input for the next step.

    Generally, a bioinformatics pipeline takes a specific type(s) or raw input data, such as DNA or RNA sequence data, frequently generated from high-throughput techniques. This data set then goes through the first stage of the pipeline where it is pre-processed for analysis steps, including quality control, data cleaning, and format conversion to ensure the data set is robust and meets a standardized format for downstream analysis. Pipelines frequently consists of a series of analysis steps or modules tailored to specific research questions/tasks, for example sequence alignment, variant calling, gene expression quantification, protein structure prediction.

    Because pipelines are designed to allow for a high degree of automation to reduce error and maximize efficiency. Frequently, users use scripts or specific workflow management tools to execute each analysis step in a predefined errors. Because of this automation, it can be tempting to “blackbox” the analysis where an inexperienced user can run complex data analysis without understanding what is happening at each stage. However, for an analysis appropriate to a specific data set it is critical that scientists understand how to configure parameters and options for each analysis step to customize the pipeline for the specific data set and research objectives.

    Bioinformatics tools are often run from the command line. Linux is frequently the operating system of choice6 as it provides a powerful and standardized command-line interface that makes it easy to automate tasks, write scripts, and integrate various tools into analysis pipelines. It also offers a high degree of customization and flexibility due to its open source nature7. In addition, it is known for its efficiency and stability which is critical when running resource-intensive analysis.

  • 6 You’re laptop is likely MacOS or Windows and Linux would be the third common OS, however, it is not designed to be as user-friendly as Mac/Windows and has a steep learning curve.

  • 7 This also means its free!

  • High performance clusters allow for efficient handling of large data set by allowing multiple tasks to be executed in parallel8.

  • 8 Your laptop probably has two to four cores, which means you could run tasks in parallel on up to two threads. Many linux servers have upward of 20 threads to run things in parallel on, for high performance clusters (HPCs you may be looking at 100s)

  • Don’t worry, you’re not going to have to learn how to use a Linux Terminal. We will be using a pipeline that runs in R.

    Dada2 (Callahan et al. 2016) is a bioinformatics pipeline which implements functions that can be used for the quality control, denoising, and analysis of amplicon sequencing data, particularly data generated from high-throughput sequencing technologies like Illumina’s MiSeq or HiSeq platforms. It is widely used in microbiome research and is becoming increasingly popular for the analysis of environmental DNA (eDNA) metabarcoding data sets.

    The key modules (steps) include

    1. Quality Filtering: DADA2 begins by assessing the quality of each sequence read in the dataset. Low-quality reads, which may contain sequencing errors or noise, are filtered out. This step helps improve the accuracy of downstream analysis.
    2. Sequence Dereplication: Identical sequences are collapsed into unique sequence variants (also known as amplicon sequence variants or ASVs). This step reduces redundancy in the dataset and accounts for potential PCR or sequencing errors.
    3. Denoising: DADA2 employs a statistical model to distinguish between true biological sequence variants and sequencing errors. By modeling the error profile of the sequencing data, it can identify and correct errors, resulting in more accurate sequence variants.
    4. Chimera Removal: DADA2 identifies and removes chimera sequences, which are artificial sequences generated during PCR that can lead to incorrect biological interpretations.
    5. Phylogenetic Assignment: After denoising, DADA2 assigns taxonomy to the sequence variants by comparing them to reference databases, allowing researchers to identify the taxa present in the sample.

    In contrast to other pipelines used to process metabarcoding data sets, DADA2 generates ASVs.

    Consider this

    Explain what the difference between an OTU and and ASV (include what each abbreviation stands for). You will want to revisit this question after having completed this tutorial and having a better understanding of the methods implemented in DADA2.

    Did it!

    [Your answer here]

    Enough chit chat! Let’s go!

    11.3 Demultiplexing

    Generally, multiple samples are pooled and run on the same NGS sequencing lane. The amplified sequences (amplicons) of each sample are tagged with the same unique molecular barcode or index. This means that the first step is to take the entire output of a sequencing run and demultiplex it, i.e. each sequence read is assigned back to its respective sample based on the barcode/index combination.

    FASTQ files are a common file format used to store biological sequence data, particularly data generated by next-generation sequencing (NGS) technologies like Illumina sequencing platforms. In contrast to FASTA files which contain only sequence data, FASTQ files contain information about the sequences of DNA or RNA fragments, associated quality scores, and optional sequence identifiers.

    This data set has already been demultiplexed and the individual fastq files are in your data folder.

    11.4 Quality check and filter sequences

    If you look in the data folder your project directory, you will see that you have a folder called seq that contains a folder for your raw sequences and also one that we will use to hold your sequences once we have done some quality filtering and trimming.

    Before we get started we are going to create a series of vectors that contain our file paths and sample names that we can use throughout the analysis9. We are also going to extract the sample names and save them in a variable. We are able to do this because all of our filenames have a similar pattern.

  • 9 There are some key advantages to setting up vectors in this way. First, you only need to type in the information once, and then you can just call the vector which is a lot shorter and easier to type in. Second, you make the code more easily reusable for another analysis because you only need to change the file paths in one location and don’t have to find all the spots in your code where the file path needs to be changed. Finally, as you will see you are working with a large number of individual files, so when we create sample names based on the file names we don’t have to type them in all by hand which saves time and also minimizes the chances of typos.

  • # designate file paths ----
    
    # create variable with initially trimmed reads
    raw <- "data/seq/raw"
    
    # path for filtered reads
    filt <- "data/seq/filt"
    
    
    # create lists of matched forward & reverse fastq files ----
    
    # forward reads
    fnFs <- sort(list.files(raw, pattern = "_R1.fastq", full.names = TRUE))
    
    # reverse reads
    fnRs <- sort(list.files(raw, pattern = "_R2.fastq", full.names = TRUE))
    
    # create vector with sample names
    sample.names <- substr(basename(fnFs), 1, (nchar(fnFs)-25))
    
    
    # create file names & paths for filtered reads ----
    
    # named vectors of forward reads 
    filtFs <- file.path(filt, glue("{sample.names}_R1.fastq.gz"))
    names(filtFs) <- sample.names
    
    # named vectors of reverse reads 
    filtRs <- file.path(filt, glue("{sample.names}_R2.fastq.gz"))
    names(filtRs) <- sample.names

    There are 74 samples in the data set.

    Now that we have all of our file paths set up our next step is to take a look at the quality profile of the sequences. We could look at all of them, however, usually it is sufficient to spot check a few sequences to get an idea of what the quality looks like to give us an idea of what threshold values we should be using for quality filtering.

    DADA2 has a built in function plotQualityProfile() that pulls the information from the fastq files which contains the quality information for each nucleotide call. Quality is measured as PHRED scores, a score of 40 should be interpreted as an expected 1 in 10,000 error rate, 20 would be a 1 in 100 chance of a base call being incorrect.

    We are wrapping the plotting function in ggplotly() which is built on the plotly package to generate interactive figures. It will take a second for your figure to pup up in the Viewer pane on the bottom left of Rstudio. If you hover over different points of the plot with your mouse you will see a little pop that lists the exact values.

    Let’s go ahead an plot the quality sequence for the first sample We can do this by passing the first file path in the fnFs vector using [1] to indicate the first element of the vector.

    ggplotly(plotQualityProfile(fnFs[1]))

    Figure 11.1: Quality scores for the forward reads of the first sample. Grey-scale underlying heatmap shows frequency of each score at each base position (darker color is higher frequency), green line is em quality score for base position, orange lines indicate the quartiles (solid is median, dashed = 25th and 75th quartile). The red line indicates the percentage of reads that extend to at least that position.

    As you can see there is a lot going on in this figure. The x-axis indicates the cycle number which is equivalent to the base position in the sequence (one base call is added per cycle) while the y-axis indicates the quality score. In the bottom left you can see the number of reads (sequences) in the file indicated. You can see grey shading which is a heatmap indicating the frequency of each score at each position. Darker colors indicate that a certain quality score is more common at that position across all the reads in the sample. The green line is the mean quality score, orange lines are the quartiles. The read lines indicates the proportion of reads that extend to that position.

    Give it a whirl

    Go ahead and plot the reverse reads for the 12th sample in the data set. Then describe the quality patterns you are seeing for your forward and reverse reads. You can use the table below that summarizes how to interpret the quality scores.

    Did it!

    [Your answer here]

    ggplotly(plotQualityProfile(fnRs[12]))

    Figure 11.2: Quality scores for the reverse reads in 12th sample. Grey-scale underlying heatmap shows frequency of each score at each base position (darker color is higher frequency), green line is median quality score for base position, orange lines indicate the quartiles (solid is median, dashed = 25th and 75th quartile). The red line indicates the percentage of reads that extend to at least that position.

    Reverse reads typically have lower quality and larger drop-off compared to forward reads. Trimming too conservatively can result in downstream issues when forward and reverse reads cannot be merged due to a lack of overlap.

    Quality score of base call Confidence of base call being correct
    10 90
    20 99
    30 99.9
    40 99.99

    11.5 Trim sequences

    We will use the information from the quality scores to make decision on appropriate threshold values to use to trim our sequences.

    We are going to use a series of filters to remove low quality reads from the samples. Specifically, we will remove any basecalls that were too ambiguous to call,10 and any remaining PhiX reads still in the data set11. Next, the first and last xx bases are trimmed for each read as these are usually low quality. Additionally, reads are truncated at first instance of a quality score < 6. Finally, reads < 35 bp after trimming are removed.

  • 10 N are unknown nucleotides. If the signal for a base is too ambiguous to make a call, the Illumina platform will call it N. DADA2 assumes there are no NNN in the data set so we have to remove them.

  • 11 PhiX is a bacteriophage. It’s DNA is spiked into libraries being sequenced to improve the quality sequencing run by increasing the sequence diversity

  • We are assuming matching order of forward and reverse reads for each sample.

    Give it a whirl

    Look up the filterAndTrim() function and annotate each line of the code to indicate what each argument does, include both what the argument controls in general and then specifically what this means for this example.

    # filter reads
    out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, #
                         truncLen = c(280, 280),     #
                         trimLeft = c(18, 2),        #
                         truncQ = 6,                 #
                         maxEE = c(2,2),             #
                         minLen = 35,                #
                         rm.phix = TRUE,             #
                         matchIDs = FALSE,           #
                         compress = TRUE)            #

    Here are descriptions of the key arguments that set thresholds for quality filters

    • truncQ: sets a minimum Q score. At the first instance of a quality score less than or equal to truncQ, the sequence is truncated.
    • truncLen: sets the length at which the sequences will be truncated. Sequences shorter than the length are eliminated.
    • trimLeft: sets the length that will be removed on the 5’ side of the reads. This allows you to remove the primers if it has not been done beforehand12.
    • maxEE: sets the maximum number of “expected errors” allowed in a read. This filter is based on the Q index. The more the number is increased, the less strict we are.
  • 12 Primers used are ITS3_KYO2: GATGAAGAACGYAGYRAA = 18bp ITS4: TCCTCCGCTTATTGATATGC = 20b

  • Let’s take a look at what that function has done. We assigned the output to a variable called out.

    Give it a whirl

    Use the functions class() and head() to get an idea of what type of object we have created and then briefly describe what information is contained in that object.

    Did it!

    [Your answer here]

    Wait - that function has created a matrix array (a type of object you could describe as a multidimensional vector or a poor man’s data frame) that tells us how many reads per sample went into a the filters and then who many came out.

    But where did our filtered reads go? Remember, we gave the function the file paths for both the input files (data/seq/raw/*.fastq.gz) and where to write the filtered files and what to name them (data/seq/fil/*.fastq.gz). If you use the file navigation pane you should see that it now contains those files.

    The object we created that is now in our environment holds the record of what occurred during filtering. It is very common that bioinformatics pipelines generate output files at various steps that let the user track what is happening that can be used to ensure everything is going as expected and also allow them to pick up on unusual patterns that might be indicative of the fact that your threshold values might not be appropriate, that there is something odd going on with the data, or that perhaps commands where not properly formulated and therefore didn’t take place the way the researcher expected them to.

    Let’s take a look to see how many reads where removed from the data set.

    Give it a whirl

    This code chunk converts the matrix into a dataframe. Use your coding skills to create a new column called perc_lost that contains the percent of reads that were lost in this filtering step. Then calculate the mean and standard deviation of the percent reads lost and the mean and standard deviation of the number of reads that remain per sample.

    Then use kable() to display your function in the console/your rendered html report.

    Note the inline code below that is pulling the key pieces of information directly from your data frame. This can be really helpful because e.g. if you are rerunning code on an updated data set you would not have to dig through your output to make sure that you updated all the results, instead R can do that for you. You can indicate that it is inline code using backticks and then indicating the engine using r. You will see that notation through out this document.

    The function pull() allows you to extract (pull) a single value from a column.

    Did it!
    # summarize proportion of reads lost during trimming
    check <- as.data.frame(out) %>%
    mean_loss std_loss mean_reads std_reads
    49.05541 8.624355 509.4459 86.24355
    Table 11.1: Mean and standard deviation of the proportion of reads lost due to quality filtering and the mean and standard deviation of the number of remaining reads per sample post filtering.

    After quality trimming samples contain approx. 509% (+/- 86) reads. Quality trimming removed 49.1% (+/- 8.6%) reads from each sample.

    We will track how many reads we “lose” at each step to understand how different steps affect the number of reads that remain in the sample.

    Next to knowing how many low quality reads where removed we will want to take a look at the quality of the remaining reads after we have trimmed.

    Give it a whirl

    Plot the quality scores for the forward and reverse of one random sample and compare your results to the raw data. Make sure to add a figure caption using the code chunk options!

    Did it!

    [Your answer here]

    This is what the plot for your forward reads should look like.

    Figure 11.3: Quality scores of forward reads for 16 random samples in the data set after trimming. Grey-scale underlying heatmap shows frequency of each score at each base position (darker color is higher frequency), green line is median quality score for base position, orange lines indicate the quartiles (solid is median, dashed = 25th and 75th quartile). The read line indicates the percentage of reads that extend to at least that position.

    Remember to plot the reverse reads for the same sample.

    Figure 11.4: Quality scores of reverse reads for 12 random samples in the data set after trimming. Grey-scale underlying heatmap shows frequency of each score at each base position (darker color is higher frequency), green line is median quality score for base position, orange lines indicate the quartiles (solid is median, dashed = 25th and 75th quartile). The read line indicates the percentage of reads that extend to at least that position.

    11.6 Generate error model

    Next, we need to be able to distinguish between error due to for example PCR or sequencing error and actual biological differences among sequences. We can train DADA2 to be able to do this using a subset of our data as a training set using a machine learning approach to establish a parametric error model13 to estimate error rates.

  • 13 Note that the error rate is specific to a sequencing run

  • Error rates are generally expected to drop with increased quality. By default 100 Million bases are used to generate the error model, this number can be increased for a better estimate.

    Give it a whirl

    Run the code below to generate the error models. This will take a few seconds. While you wait, pull up the description of what this function does in the help pane and use comments to indicate what the function as a whole does and what the different arguments control.

    # forward reads
    errF <- learnErrors(filtFs,               # 
                        nbases = 1e8,         # 
                        multithread = FALSE)  # 
    9877138 total bases in 37699 reads from 74 samples will be used for learning the error rates.
    # reverse reads
    errR <- learnErrors(filtRs, 
                        nbases = 1e8,
                        multithread = FALSE)
    10480322 total bases in 37699 reads from 74 samples will be used for learning the error rates.

    Machine learning estimates and observed values should show close overlap to indicate the quality (fit) of the model. The observed error rates should be consistent with the expected learned error rates for the 16 possible base transitions. If they do not, the number of bases used can be increased to allow the ML algorithm to train on a larger subset of the data.

    We can use the function dada2::plotErrors() to compare the observed and estimated error plots as an indication of how good our error models are. We will plot both the forward and reverse error model and assign those to objects so we can use the patchwork package to plot them next to each other in a single plot14.

  • 14 the package patchwork allows us to combine multiple plots in one file. You can read up on the basic layout options using +, \ and | here and more a sneak peak of more complex layouts here

  • # Visualize estimated error
    p1 <- plotErrors(errF, nominalQ = TRUE)
    
    p2 <- plotErrors(errR, nominalQ = TRUE)
    
    p1 / p2

    Figure 11.5: Error rates for each possible transition for forward (top) and reverse reads (bottom). Observed (grey points) and estimated (black line) error rates for each consensus quality score. Expected error rates for nominal definition of the quality score are in red.

    The plot will pop up in the Plot pane. You can change the size of the pane to resize and make it larger, it can also be helpful to use the zoom button to create a popout window for an even better look.

    Note also that you can use the code chunk option fig-height: to control the size in your rendered html output file.

    As you can see this creates a series of plots - one for each possible transition e.g. the error frequency for an A being mistakenly called a C (A2C) etc. The black points are the observed error rates for each consensus quality scores. The black line is the estimated error rate after the model has converged. The red line indicates the expected error rates under the nominal definition of the Q-value for Illumina technology.

    Consider this

    Use the figures to briefly describe how well our error models capture the observed data and how this compares to the expected error rates.

    Did it!

    [Your answer here]

    12 Infer amplicon sequence variants (ASVs) and create sequence table

    DADA2 uses exact variants instead of OTUs (operational taxonomic units)15. The pipeline infers ASVs from the set of reads by incorporating the census quality profiles and abundances of each unique sequence to determine whether a given sequence is more likely a true biological sequence or more likely spurious (Callahan et al. 2016; Rosen et al. 2012).

  • 15 An OTU us a cluster of DNA sequences considered to belong to the same taxonomic unit based on a predefined sequence similarity threshold. By contrast, an ASV is a unique sequence variant obtained from raw, high resolution sequence data, there is no clustering based on similarity. As a result, multiple ASVs (haplotypes) may correspond to the same species.

  • Callahan, Benjamin J., Paul J. McMurdie, Michael J. Rosen, Andrew W. Han, Amy Jo A. Johnson, and Susan P. Holmes. 2016. DADA2: High-resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7, 7): 581–83. https://doi.org/10.1038/nmeth.3869.
    Rosen, Michael J., Benjamin J. Callahan, Daniel S. Fisher, and Susan P. Holmes. 2012. “Denoising PCR-amplified Metagenome Data.” BMC Bioinformatics 13 (1): 283. https://doi.org/10.1186/1471-2105-13-283.

    Running this step on all samples at the same time is computationally more intensive than running the pipeline one sample at a time but increases the functions ability to resolve low-abundance ASV, including singletons.

    12.1 Dereplicate, infer, and merge ASVs

    Now we are ready to identify the ASVs present in our data set.

    The first step to do this is dereplication, which means that we are collapsing all identical reads into a set of unique sequences. Dereplicating or denoising data is an important step in amplicon processing workflows, instead of keeping all identical sequences only one is kept for processing and the number of sequences represented is stored a long with it. A new consensus quality score profile is calculated for each unique sequence based on the average quality score of each base for all sequences that are replicates of that given amplicon. These quality profiles Dereplication makes downstream processing a lot more efficient and less memory intense by eliminating redundant comparisons.

    The function lapply() is similar to a for loop, where it applies the same function to every file in the vector containing the filtered samples.

    derepFs <- lapply(filtFs,      
                      derepFastq,      
                      verbose = FALSE)  
    
    derepRs <- lapply(filtRs, 
                      derepFastq, 
                      verbose = FALSE)

    In the next step, sequence variants are inferred using the dereplicated data and the inferred error rates. Using the consensus quality profiles significantly increases DADA2’s accuracy.

    As we’ve mentioned previously, many metabarcoding pipelines use OTUs instead of ASVs. For OTU’s reads that are at least 97% similar are clustered. There are two arguments for doing this, the first is many species have more than one haplotype for the same locus and so if you set the threshold for clustering at 100% you are oversplitting i.e. multiple unique sequences can “belong” to the same species. The second is that sequences from the same species might end up with different sequences not because of true biological differences but because of errors in the process of PCR amplifying sequences or during sequencing. However, ASVs are not the same thing as OTUs with a 100% threshold for clustering, instead, DADA2 uses the information from the quality profiles and the error models to distinguish true biological differences that separate unique amplicons vs amplicons that are only different due to errors (i.e. base differences are artifacts).

    dadaFs <- dada(derepFs,            
                   err = errF,         
                   selfConsist = FALSE,
                   multithread = TRUE) 
    
    dadaRs <- dada(derepRs,
                   err = errR,
                   selfConsist = FALSE,
                   multithread = TRUE)

    Different sequence platforms and sequencing kits are limited by how long the reads can be. For example, this data set was sequenced on a 2x300bp Miseq platform. This means that each amplicon was sequenced 300 bp in the 5’ to 3’ direction and then 300bp in the 3’ to 5’ direction. This means that we can sequence inserts longer than 300 bp and merging the two strands (forward and reverse reads) increase the confidence in the reliability of the sequence in the overlapped region.

    Our last step is that we need to merge our forward and reverse reads to reconstruct the full target amplicon. Forward and reverse-complement of corresponding reverse sequences are aligned and merging requires an overlap of 12 sequences and there are no mismatches permitted in the overlapping region. Paired reads that do not exactly overlap are removed to reduce spurious output.

    Give it a whirl

    Look up the mergePairs() function in the help pane and comment the code below to indicate what the function as a whole and each argument does.

    mergeASVs <- mergePairs(dadaFs, filtFs, 
                            dadaRs, filtRs,
                            minOverlap = 12,  
                            maxMismatch = 0)

    We can take a look at the results. The function returns a list16. Let’s take a look at the first element.

  • 16 Remember, we can think of a list as an object that is like a shelf where each shelf holds one element in this case each sample is an element

  • head(mergeASVs[1])
    $S1
                                                                                                                                                                                                                                                                                                                                                                                                                sequence
    1             ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCCCCTTGGTATTCCGAGGGGCACACCCGTTTGAGTGTCGTGAATACTCTCAACCTTCTTGGTTTCTTTGACCACGAAGGCTTGGACTTTGGAGGTTTTTCTTGCTGGCCTCTTTAGAAGCCAGCTCCTCCTAAATGAATGGGTGGGGTCCGCTTTGCTGATCCTCGACGTGATAAGCATCTCTTCTACGTCTCAGTGTCAGCTCGGAACCCCGCTTTCCAACCGTCTTTGGACAAAGACAATGTTCGAGTTGCGACTCGACCTTACAAACCTTGACCTCAAATCGGGTGAGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    2                                                                                             ATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATCCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCCTGGCTTGGTGTTGGGCGACGTCCCCTTTTGGGGACGCGTCTCGAAACGCTCGGCGGCGTGGCACCGGCTTTAAGCGTAGCAGAATCTTTCGCTTTGAAAGTCGGGGCCCCGTCTGCCGGAAGACCTACTCGCAAGGTTGACCTCGGATCAGGCAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    3                                                                                                 ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTATAACCACTCAAGCCCCGGCTTGGTCTTGGGGTTCGCGGTCCGCGGCCCTTAAACTCAGTGGCGGTGCCGTCTGGCTCTAAGCGCAGTAATTCTCTCGCTATAGTGTCTAGGTGGTTGCTTGCCATAATCCCCCAATTTTTTACGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    4                                          ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCACTCCTTGGTATTCCGAGGAGTATGCCTGTTTCAGTATCATGAGCACTCTCACACCTAACCTTTGGGTTTATGGCGTGGAATTGGAATGCGCCGACTGTCATGGTTGGCCCTTCTAAAATGTAGTTCTTGGCTGTCACCTAATACAGCAGTTTGGCCTAATAGTTTTGGCATTCATTGTCAAATCTTTGGCTAACATTTGCTCCAGGAGTCAGTCTTGATAATACAGAAAACTCATTCAAATTTTGATCTGAAATCAGGTAGGGCTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    5          ATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATTGAATCTTTGAACGCATCTTGCGCCTCTTGGTATTCCGAGGGGCATGCCTGTTTGAGTGTCATTAGAACTATCAAAAAAATAGATGATTTCAATCGTTAATTTTTTTGGAATTGGAGGTGGTGCTGGTCTTTTTCCATTAATGGCCCAAGCTCCTCCGAAATGCATTAGCGAATGCAGTGCACTTTTTCTCCTTGCTTTTTCTGGGCATTGATAGTTTACTCTCATGCCCTAAGCTGGTAGGGAGGAAGTCACAGAATGCTTCCCGCTCCTGAATGTAATACAAAACTTGACGATCAAACCCCTCAAATCAGGCAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    7                                                                                 ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCATATTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTCGAGCGTCATGATCAACCATCAAGCCTGGCTTGTCGTTGGACCCTGTTGTCTCTGGGCGACAGGTCCGAAAGATAATGACGGTGTCATGGCAACCCCGAATGCAACGAGCTTTTTTATAGGCACGCATTTAGTGGTTGGCAAGGCCCCCTCGTGCGTTATTATTTTCTTACGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    8                                                                                            ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTGCAACCCTCAAGCATTGCTTGGTATTGGGCTCCGCTGCTCACCCAGCGGGCCTTAAAATCAGTGGCGGTGCCGTCGAGGCCCTGAGCGTAGTAAATATCCTCGCTATAGGGACTCGGTGGACGCTGGCCATTAACCCCCAACTTTCTAAGTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    9                                                                                            ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCACCCTCTGGTATTCCAGGGGGTATGCCTGTTCGAGCGTCATTACAACCCTCAAGCACTGCTTGGTATTGGATGTCAACCATTGGTGGTGCATCTCAAAAGTATTGGCAGTAGCATTTAGCTTCTAGTGTAGTAAATTTCTCGCTTTGGAGTCAAGTGTCTAATTGCTAGATAGAACCCCTAATTTATCAAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    10                                                                                                   ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCTCTGCTTGGTGTTGGGCCCCGCCCGCTCGCGGCCGGCCCTAAAGACAGTGGCGGCAGCGTCTGGCTCCAAGCGTAGTACAATCCTCGCTCTGGTGCTAGGCGGTGGCCTGCCAGAACCCCCCTTCTATGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    11                                                                                               ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCACCTTCTGGTATTCTGGGAGGTATGCCTGTTCGAGCGTCATTGCAACCATCAAGCCTAGGCTTGGTATTGGATGCCACCGCTTGGTGCATTTCAAAATTAGTGGCGGTGCCATTCAGCTTCAAGCGTAGTAAATTTCTCGCTCCTGGAGTTTGTATGTTGTCTGCTAGAACCCCCTAATTTATCAAGGTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    12                                                                                                                       GCGCGATAGGTATTGTGAATTGCAGAATTGTGAATCATCGAATTTTTGAACGCACATTGCACCCATTGGTATTCCGATGGGTATACTTGTTTGAGCGTCATTTCATTCTCCTTTTGGGTTTTGGCATGAATATTTCTTGCTGAATTATAATGGTGTGGCTACCAGACTACAACGTGATAGATATTTCGTTGGATGTGACTGGGATTGCTCACCTTAAAAACATTGTATAGACCTCAAATCAAGCAGGATTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    13 ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATATTGCGCTCTTTGGTATTCCGAAGAGCATGCTTGTTTGAGTATCAGTAAACACCTCAACCTCCTCTTGTTTTTTCAAAAGGAGGGTGGACTTGAGCTATCCCAACAACCTTCACCGGTAGGCGGGCGGCTTGAAATGCAGGTGCAGCTGGACTTTTATCTGAGCTAAAAGCATATCTATTTAGTCCTCGTCAAACAGGATTATTACTATTGCTGCAGCTAACATAAAGGATAATTGTCCTCATTGCTGACTGATGCAGGATTTTACGACACTTTATGTGTTGTTCAACTCGATCTCAAATCAAGTAAGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    14                                                                             ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCATATTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTCGAGCGTCATTATCAACCATCAAGCCTGGCTTGTCGTTGGACCTCTTTGCCAATGAAATATGTGGCAGGTCCGAAAGATAATGACGGCGTCGTGTTTGACCCTAGATGCAACGAGCTTTTTATAGCACGCATTGATGTGGTCGGGCGACCCAGTCTTTAACCATTATTTTCTAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    15                                                                                 ATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCTCCTGGTATTCCGGGAGGCATGCCTGTTCGAGCGTCATTAAAGACCACTCAAGCGATTTTGCTTGGTATTGGAAGAAGAGTGCCTCTGGCCCTCCCTTCCGAAATCCAATGGCGGAAAGTCTCACGTGCCCCGGCGTAGTAAGTTTATCTTTCGCTTGGACCCTGAGGCGTTCTCGCCCTCAAATCCCCAATACTATAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    16                TTGCGATATGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCATATTGCACCCTCTGGTCCATTCCAGAGGGTATGCCTGTTTGAGTGTCATTAATGTATCAAACCACCAAGCTTGCTTGGTTGGTCTTGGATGTTGAGGGTTGCTGGGGTTATAATGATCAGCTCCCTTTAAATGCATTAGCTTGGAATGTATAAGCCATTTTAGCTTAGGCTGATATGAATACAGCGTATTAAATGCTTTTGCTAAAGTGTAGCTTGTCTGGGCTTATAACTGTCTCTAGCTGAGACTGTCTTTTGACATTGTTAAATCATGATCATGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    17                                TTGCGATATGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCACCCTTTGGTATTCCGAAGGGTATGCCTGTTTGAGTGTCATTAAATTCTCAACTTCAAATTGAATTTTGAAGCTTGGACTTTGGAGGTTTGCTGGTGTCACTATCGGCTCCTCTTAAATTCATTAGCGGAACTGTAAGGACCGGCTTTGGTTTGATAGCTAACATTATCTATGCCGTTGCTGTGACCTTTGTGTTTGGCTTCTAATGGTCATTTTGTTGACTGTCTCTGCTTTGAGGCATACACTTTTAAGCTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    18                                                                                                    ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGGTGGTATTCCGCCGGGCATGCCTATTCGAGCGTCATTACAACCCTCACGCCCCGCGTGGTCTTGGGCCGAGCCCCCCGGGCTGGCCTCAAAAGCAGTGGCGGTGCCTCTGGGTCCTGAGCGTAGTAACACTTCCGCTACAGGGCTCCCGAGCGTGCTGGCCGAACCCCAACCCTTCAGGTTGACCTCGGATTAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    19                                                                                  ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCATATTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTCGAGCGTCATTATCACAGTATCAAGCTTGGCTTGTCGTTGGGCCCTTTGTCACCTGGTGACAGGTCCCAAAGAGAATGACTGGTGTCGTAAAGACTCTAAATGCAACGAGCTTATAACAGCACGCATCTAGTAGTAATATGGCCCGGTTCTCACCTCTTTATTTCTCAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    21                                                                                            ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTACAACCCTCAAGCAATGCTTGGTGTTGGGCCGCGCCGCTAACCCGGCGGGCCCTAAAACCAGTGGCGGTGCCGTCGGGCTCTGAGCGTAGTAATTCTTCTCGCTATAGAGCCCCGGCGGATGCTAGCCAGCAACCCCCAATTTTCTATGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    22     ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCTCTTTGGTATTCCGAAGAGCATGCCTGTTTGAGTGTCATGAAAATATCAACCTTGACTTGGGTTTAGTGCTCTTGTCTTGGCTTGGATTTGGCTGTTTGCCGCTCGAAAGAGTCGGCTCAGCTTAAAAGTATTAGCTGGATCTGTCTTTGAGACTTGGTTTGACTTGGCGTAATAAGTTATTTCGCTGAGGACAATCTTCGGATTGGCCGAGTTTCTGGGACGTTTGTCCGCTTTCTAATACAAGTTCTAGCTTGCTAGACATGACTTTTTTATTATCTGGCCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    23                                                                                                    ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCTCTGCTTGGTGTTGGGCCCCGCCCCCGTGGCCGGCCCCAAAGTCAGTGGCGGTGCCGTCCGGCTCTAAGCGTAGTACATCTCTCGCTCTAGGGTCCCGCGGTGGCCTGCCAGAACCCCAACTTCTATGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    24                                                                                                 ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTGTGGTATTCCGCAGGGCATGCCTGTTCGAGCGTCATTTAACCACTCAAGCCTAGCTTGGTATTGGGGCACGCGGTCTCGCGGCCCTTAAAATCAGTGGCGGCGCCGGTGGGCTCTAAGCGTAGTACATACTCCCGCTATAGAGTTCCCTCGGTGGCTCGCCAGAACCCCTAATTTTTACAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    25                                           ATGCGATACGTAATGTGAATTGCAGATTCAGTGAATCATCGAATCTTTGAACGCATATTGCACTCTTTGGTATTCCGAAGAGTATGCCTGTTTCAGTATCATGAAAAACCTCACAAATTCAATTTTGGCTTTGTGGACTTGAGCATTTTGCGGCTTTGTTGCTGCTGGCTTAAAATATATTTCTTGGATAGCATATTATGGCTTTCGAAACTCGGCTTAATAGTTTTGGCTTTTGGTCAAATCTTTAGCTCTTTTCAAAGTCTTCAAGTTATTCAAAAGTTTTATACGAACACTTTCTCAATTTTGATCTGAAATCAGGTAGGATTACCCGCTGAACTTAAGCATATCAATAAGCGGAG
    26                                                                                               CTGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCACTTCAACCCTCAAGCTCTGCTTGGTGTTGGGCCCTGCCGGCGACGGCAGGCCTTAAAACCAGTGGCGGCGCCGCTGGGCCCTGAGCGTAGTAATACTCCTCGCTACTGGGCCCCAGCGGATGCCTGCCAGCAAACCCAACTTTCTATGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAAGCATATCAATAAGCGGAG
       abundance forward reverse nmatch nmismatch nindel prefer accept
    1        172       1       1    150         0      0      2   TRUE
    2         47       2       4    230         0      0      1   TRUE
    3         43       3       3    234         0      0      2   TRUE
    4         34       5       5    179         0      0      2   TRUE
    5         32       6       2    147         0      0      2   TRUE
    7         21       4       6    218         0      0      1   TRUE
    8         17       8       9    229         0      0      2   TRUE
    9         17       9       7    229         0      0      2   TRUE
    10        13       7      10    237         0      0      1   TRUE
    11        12      20      21    233         0      0      1   TRUE
    12        12      10       8    257         0      0      2   TRUE
    13        10      11      12    139         0      0      1   TRUE
    14         9      19      20    215         0      0      1   TRUE
    15         9      15      13    219         0      0      2   TRUE
    16         6      17      22    154         0      0      1   TRUE
    17         6      16      23    170         0      0      1   TRUE
    18         6      13      16    238         0      0      1   TRUE
    19         5      23      27    220         0      0      1   TRUE
    21         4      26      25    230         0      0      2   TRUE
    22         4      14      11    143         0      0      1   TRUE
    23         4      22      26    238         0      0      1   TRUE
    24         3      21      24    235         0      0      1   TRUE
    25         3      18      15    181         0      0      2   TRUE
    26         2      25      30    233         0      0      1   TRUE

    This contains a dataframe where for each sequence (ASV) in that sample we get information on what the sequence looks like, the number of reads corresponding to this forward/reverse combination (abundance) etc.

    For example we can query the largest and smallest overlap like this.

    # Largest overlap 
    max(mergeASVs[[1]]$nmatch) 
    [1] 257
    # Smallest overlap
    min(mergeASVs[[1]]$nmatch) 
    [1] 139

    12.2 Create a sequence table

    Now we have a fully denoised set of sequences that can be used to generate a sequence table for further analysis that contains the merged sequence, abundance and indices of forward and reverse sequence variants that were merged.

    seqtab <- makeSequenceTable(mergeASVs)

    Let’s take a quick look at what this data set looks like.

    dim(seqtab)
    [1]  74 701
    seqtab[,1]
     S1 S10 S11 S12 S13 S14 S15 S16 S17 S18  S2 S20 S21 S22 S23 S24 S25 S26 S27 S28 
      0 218   0   0 124   0  48   0   0   0   0   0   0   0   0   0   0   0   0   0 
    S29  S3 S30 S31 S33 S34 S35 S36 S37 S39  S4 S41 S42 S43 S44 S45 S46 S47 S48 S49 
      0   0   0   0   0   0  19   0 148   0   0   0   0  70   0   0   0   0 113  32 
     S5 S50 S51 S52 S53 S54 S55 S56 S58 S59  S6 S60 S61 S62 S63 S64 S65 S66 S67 S68 
      0   0   0  46   0 117   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
    S69  S7 S72 S73 S74 S75 S76 S77 S78 S79  S8 S80 S81  S9 
      0   0   0   0   0 234   0   0 109   0  43   0   0  40 

    We can see that it has 74 rows (each sample is a row) and 701 columns (each ASV is a row). If we print just the first column (ASV) you can see that the way a sequence table is organized is that for each sample the number of times an ASV is observed is recorded.

    12.3 Remove chimeras

    The DADA2 algorithm accounts for indel errors and substitutions when inferring ASVs but before taxonomic assignment we also need to check for chimeras17. DADA2 identifies sequences that are likely chimeras by aligning each sequence with sets of sequences that were recovered in higher abundance and then determining if any lower-abundant sequences can be made by mixing left and right sequences from two of the more abundant ones.

  • 17 Chimeras are non-biological sequences were the left- and right segment of the merged sequence are from two or more parent sequences.

  • We can use the function removeBimeraDenovo() to identify and remove Chimeras and then creating and updated sequence table.

    seqtab.nochim <- removeBimeraDenovo(seqtab,
                                        method = "consensus", 
                                        multithread = FALSE, 
                                        verbose = TRUE)

    After removing chimeras, the data set consists of 700 unique sequences across 74 samples.

    Even though chimeric sequences can frequently make up a large part of sequence variants and therefore initially make the data set seem more variable than it is, overall once you account for abundance, they should only be a very small component of the merged sequence reads. Here 0.1 % of merged sequence variants are chimeras, though once you account for abundance of these variants, overall 99.9% of merged sequences are not chimeric.

    Let’s take a look at the distribution of our non-chimeric ASV lengths.

    Figure 12.1: Distribution of sequence length for merged ASVs. The red dotted line indicates sequences that are 100-105bp long. We are targeting a 106 bp amplicon and are using 2x150 bp sequencing. Primer sites are approx 25bp each.

    12.4 Summary of read filtering & processing for QC

    At this point, we will want to take a look at how many reads where lost at each step to determine if those patterns look as expected. We can pull that information from various output files by counting the reads and putting them all in a dataframe.

    # custom function to get read numbers
    getN <- function(x) sum(getUniques(x))
    
    # create data table with number of reads per sample at each step
    track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergeASVs, getN), rowSums(seqtab.nochim)) %>%
      as.data.frame() %>%
      rownames_to_column("sample") %>%
      rename(trimmed = reads.out,
             denoisedF = V3,
             denoisedR = V4,
             merged = V5,
             rm_chimera = V6) %>%
      pivot_longer(names_to = "filter", values_to = "reads", cols = 2:7) %>%
      mutate(filter = ordered(filter, levels = c("reads.in", "trimmed", "denoisedF", "denoisedR", "merged", "rm_chimera")),
             k_reads = reads/1000)
    
    # plot distribution
    ggplot(track, aes(x = filter, y = k_reads)) +
      geom_boxplot(fill = "darkorange") +
      labs(x = "filter/processing step", y = "thousand reads per sample") +
      theme_standard

    Figure 12.2: Comparison of change in the number of reads per sample at each filtering & processing stage. Red dotted line indicates targeted 150k reads per sample.
    Consider this

    Take a look at the boxplot and describe how the different processing steps have impacted the number of reads remaining in the data set.

    Remember that this is a modified data set where each sample has been subsampled to contain 1000 reads - normally that initial box describing the distribution of samples would be wider as different samples would contain different numbers of reads. Generally, researchers target > 75,000 - 150,000 reads per sample to make sure that all taxa present can be recovered in the data set.

    Did it!

    [Your answer here]

    Outside of the initial filtering there should not be any steps at which a substantial amount of reads are lost, the majority of reads should merge. If this is not the case, then trimming is likely to conservative and should be revisited. Similarly, only a small proportion should be chimeric. If primers are not completely removed, ambiguous nucleotides can interfere during chimera ID and that step of quality filtering should be revisited.

    12.5 Taxonomic classification for ASVs present in each sample.

    We are now in the final stretch. Now that we have our ASVs and our sequence table the only thing that we still need to do is figure out which species (or other taxonomic groups) our ASVs correspond to.

    This brings us to an important limitation of metabarcoding studies which is that our results are only ever as good as our reference database to which we can match our ASVs.

    Consider this

    Our next step is to match the sequences to a list of sequences that have taxonomic information. In our case study this would be a sequenced fungi. Describe your expectations of the results - do you expect every ASV in the data set to find a match in the reference? What other issues could result in results being ambiguous?

    Did it!

    [Your answer here]

    DADA2 has a built in function of a naive Bayesian classification menthod (assignTaxonomy()) that takes the sequences to be classified as the input along with a fasta file that contains the reference sequences18.

  • 18 fasta files consist of a header line that starts with > and can contain any information about the sequence and then the sequence itself is in the next line. A multifasta file can contain multiple sequences, the program using the file determines the start of a new sequence by the fact that it “sees” the > of the next header file.

  • Depending on the project, scientists will pull sequences available from public databases such as genbank or if you are working with species that do not have a lot of sequences available you would need to create your own database by sampling individuals representing the different species/taxonomic groups you expect to encounter and sequencing them.

    We are using the UNITE database of references which is designed to gather available ITS sequences to identify Eukaryotes.

    Running this function is going to take several minutes - depending on the speed of your computer. Note that this code has the chunk option eval set as false so that it won’t run. Instead you can see that using the function save() I have writen the object out to your results folder. Then we can load it into your R environment in the next code chunk using load().

    taxotab <- assignTaxonomy(seqtab.nochim,
                              refFasta = "data/sh_general_release_dynamic_25.07.2023.fasta",
                              minBoot = 50, 
                              multithread = FALSE)
    
    save(taxotab, 
         file = "results/taxotab.rds")

    We can take a look at our results… there is a lot of information in this table so let’s just pull the first 5 ASV’s taxonomy without the ASV sequence for a better viewing experience.

    # load object into environment
    load("results/taxotab.rds")
    
    # look at first 5 ASVs' taxonomy without the ASV sequence
    write.table(taxotab[1:5,], row.names = FALSE)
    "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
    "k__Plantae" NA NA NA NA NA NA
    "k__Fungi" "p__Basidiomycota" "c__Agaricomycetes" "o__Agaricales" "f__Tricholomataceae" "g__Mycena" NA
    "k__Fungi" "p__Mucoromycota" "c__Umbelopsidomycetes" "o__Umbelopsidales" "f__Umbelopsidaceae" "g__Umbelopsis" "s__dimorpha"
    "k__Fungi" "p__Ascomycota" "c__Sordariomycetes" "o__Xylariales" "f__Amphisphaeriaceae" "g__Polyscytalum" "s__algarvense"
    "k__Fungi" "p__Ascomycota" "c__Dothideomycetes" "o__Mytilinidales" "f__Gloniaceae" "g__Cenococcum" "s__geophilum"

    We can also pull the unique species identified.

    unique(unname(taxotab[,7])) 
      [1] NA                    "s__dimorpha"         "s__algarvense"      
      [4] "s__geophilum"        "s__flavescens"       "s__terricola"       
      [7] "s__elongatum"        "s__punicea"          "s__sylvestris"      
     [10] "s__humilis"          "s__subvinosa"        "s__opacum"          
     [13] "s__abramsii"         "s__ericae"           "s__terminalis"      
     [16] "s__variata"          "s__rexiana"          "s__zollingeri"      
     [19] "s__deciduus"         "s__album"            "s__chlamydosporicum"
     [22] "s__saponaceum"       "s__rufescens"        "s__populi"          
     [25] "s__podzolica"        "s__reidii"           "s__asperellum"      
     [28] "s__microspora"       "s__simile"           "s__acicola"         
     [31] "s__subsulphurea"     "s__camphoratus"      "s__pseudozygospora" 
     [34] "s__heterochroma"     "s__brunneoviolacea"  "s__verrucosa"       
     [37] "s__auratus"          "s__mutabilis"        "s__chlorophana"     
     [40] "s__fuckelii"         "s__miniata"          "s__lignicola"       
     [43] "s__pilicola"         "s__phyllophila"      "s__australis"       
     [46] "s__citrina"          "s__fragilis"         "s__conica"          
     [49] "s__lubrica"          "s__pygmaeum"         "s__isabellina"      
     [52] "s__var._bulbopilosa" "s__finlandica"       "s__echinulatum"     
     [55] "s__lacmus"           "s__trabinellum"      "s__reginae"         
     [58] "s__spadicea"         "s__myriocarpa"       "s__physaroides"     
     [61] "s__calyptrata"       "s__nigrella"         "s__carneum"         
     [64] "s__vagans"           "s__metachroides"     "s__fumosa"          
     [67] "s__cantharellus"     "s__laetior"          "s__fusiformis"      
     [70] "s__spirale"          "s__pullulans"        "s__crocea"          
     [73] "s__sublilacina"      "s__acerinum"         "s__macrocystis"     
     [76] "s__vrijmoediae"      "s__changbaiensis"    "s__cygneicollum"    
     [79] "s__hymenocystis"     "s__dioscoreae"       "s__alnicola"        
     [82] "s__difforme"         "s__bicolor"          "s__spurius"         
     [85] "s__griseoviride"     "s__rebaudengoi"      "s__rufum"           
     [88] "s__globulifera"      "s__skinneri"         "s__sindonia"        
     [91] "s__verhagenii"       "s__maius"            "s__anomalovelatus"  
     [94] "s__diversispora"     "s__fellea"           "s__splendens"       
     [97] "s__coccinea"         "s__nitrata"          "s__risigallina"     
    [100] "s__juniperi"         "s__columbetta"       "s__rhododendri"     
    [103] "s__cinereus"         "s__fusispora"        "s__scaurus"         
    [106] "s__soppittii"        "s__grovesii"         "s__atropurpureum"   
    [109] "s__renispora"        "s__pura"             "s__foliicola"       
    [112] "s__phaeococcinea"    "s__rosea"            "s__stuposa"         
    [115] "s__minima"           "s__atrovirens"       "s__canadensis"      
    [118] "s__silvestris"       "s__sepiacea"         "s__pyriforme"       
    [121] "s__bulbillosa"       "s__glutinosum"       "s__cylichnium"      
    [124] "s__aeria"            "s__veluwensis"       "s__epicalamia"      
    [127] "s__hyalina"          "s__cylindrica"       "s__miyabei"         
    [130] "s__terrestris"       "s__rimosissimus"     "s__acuta"           
    [133] "s__myxotrichoides"   "s__physodes"         "s__alpina"          
    [136] "s__fallax"           "s__fumosibrunneus"   "s__albicastaneus"   
    [139] "s__mors-panacis"     "s__glacialis"        "s__acerina"         
    [142] "s__flavidum"         "s__ocularis"         "s__exigua"          
    [145] "s__piceae"           "s__verzuoliana"      "s__alliacea"        
    [148] "s__entomopaga"       "s__hyalocuspica"     "s__umbrosum"        
    [151] "s__bombacina"        "s__boeremae"         "s__fortinii"        
    [154] "s__miyagiana"       
    Give it a whirl

    Use your coding skills to additionally pull the unique genera, families, and orders contained in the data set and describe our results.

    Did it!

    [Your answer here]

    12.6 From the beginning …

    Consider this

    Now that we have been through all the steps of processing next generation sequencing data set in a metabarcoding processing pipeline go back over the entire process, make sure you have understand what the key steps are and then outline that process below. Be sure to list the key steps in an organized manner and describe what occurs at each step in 2-3 sentences.

    Did it!

    [Your answer here]

    12.7 Acknowledgements

    This chapter borrows heavily from the original DADA2 tutorial as well as Alexis Carteron & Simon Morvan’s tutorial based on the subset sequences from (Carteron et al. 2021).

    Carteron, Alexis, Marie Beigas, Simon Joly, Benjamin L. Turner, and Etienne Lalibert’e. 2021. “Temperate Forests Dominated by Arbuscular or Ectomycorrhizal Fungi Are Characterized by Strong Shifts from Saprotrophic to Mycorrhizal Fungi with Increasing Soil Depth.” Microbial Ecology 82 (2): 377–90. https://doi.org/10.1007/s00248-020-01540-7.